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ABSTRACT 

Motivation: Progress in protein biology depends on the reliability of re- 
sults from a handful of computational techniques, structural alignments 
being one. Recent reviews have highlighted substantial inconsistencies 
and differences between alignment results generated by the ever-grow- 
ing stock of structural alignment programs. The lack of consensus on how 
the quality of structural alignments must be assessed has been identified 
as the main cause for the observed differences. Current methods assess 
structural alignment quality by constructing a scoring function that at- 
tempts to balance conflicting criteria, mainly alignment coverage and fi- 
delity of structures under superposition. This traditional approach to 
measuring alignment quality, the subject of considerable literature, has 
failed to solve the problem. Further development along the same lines is 
unlikely to rectify the current deficiencies in the field. 
Results: This paper proposes a new statistical framework to assess 
structural alignment quality and significance based on lossless informa- 
tion compression. This is a radical departure from the traditional ap- 
proach of formulating scoring functions. It links the structural alignment 
problem to the general class of statistical inductive inference problems, 
solved using the information-theoretic criterion of minimum message 
length. Based on this, we developed an efficient and reliable measure of 
structural alignment quality, /-value. The performance of /-value is 
demonstrated in comparison with a number of popular scoring func- 
tions, on a large collection of competing alignments. Our analysis 
shows that /-value provides a rigorous and reliable quantification of 
structural alignment quality, addressing a major gap in the field. 
Avai labi lity : http://lcb. i nfotech . monash .edu .au/l -val ue 
Contact: arun.konagurthu@monash.edu 

Supplementary information: Online supplementary data are available 
at http://lcb.infotech.monash.edu.au/l-value/suppl.html 

1 INTRODUCTION 

A protein structural alignment is an assignment of residue-resi- 
due correspondences between the amino acids of two or more 
proteins, based on their 3D structure. Protein structural align- 
ments support basic and applied research in molecular biology. 
For example, they reveal how protein families evolve, identify 
patterns of conservation in amino acid sequences that fold into 
similar structures, facilitate comparative modelling of structures 
from sequence and guide experimental solutions to structures 
using crystallographic molecular replacement (Konagurthu 
et al, 2006). 

The last four decades have seen the development of many 
methods aimed at generating biologically meaningful structural 
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alignments. While the number of new methods is estimated to be 
doubling roughly every five years (Hasegawa and Holm, 2009), 
several comparative studies have observed many inconsistencies 
and paradoxes when comparing the alignments generated by 
existing methods. Noteworthy among these studies are those 
by Michael Levitt (Kolodny et al, 2005), Liisa Holm 
(Hasegawa and Holm, 2009) and Manfred Sippl (Sippl and 
Wiederstein, 2008; Slater et al, 2013) and colleagues. 
A common theme emerging from all these studies is the need 
for a systematic framework to asses the quality of structural 
alignments. While a handful of quantitatively rigorous statistical 
models for structure comparison have been proposed for this, 
there is no consensus regarding their usefulness. 

This is in stark contrast to the state-of-the-art in the closely 
related problem of aligning protein sequences, where many rigor- 
ous statistical models have been proposed to quantitatively assess 
sequence alignment quality (Allison et al, 1992; Altschul, 1991; 
Karlin and Altschul, 1990). This has, in turn, helped standardize 
the task of measuring sequence alignment quality and, thus, the 
task of generating meaningful sequence alignments. 

In this work, we begin by examining the foundations of how 
structural alignments are currently assessed. Guided by good 
biological insights, current structural aligners use a scoring func- 
tion to quantify the structural alignment quality. This has trad- 
itionally been achieved by combining the contributions of a small 
number of important criteria into an easy-to-compute scoring 
function. [For a comprehensive list of commonly used scoring 
functions, see Hasegawa and Holm (2009)]. 

Overwhelmingly, the two key criteria that various current 
measures use are coverage and fidelity. Typically, coverage meas- 
ures the number of correspondences (or equivalences) in an 
alignment and, in some cases, also considers the number of 
gaps. Fidelity, measures how similarly positioned the aligned 
residues are. This is commonly (but not always) based on the 
root-mean- square deviation (RMSD) computed after the best 
rigid-body transformation of corresponding residues is found. 

To search for the best structural alignment, the goal of the 
aligners is to simultaneously maximize coverage and fidelity. 
However, these two objectives are in direct conflict with each 
other. We observe that most of the current proliferation of struc- 
tural alignment scoring functions arise from attempts to reconcile 
this conflict, that is, existing scoring functions differ mainly in 
how they combine these two criteria. As the reviews show, exist- 
ing scoring functions do not generate consistent results, even 
when aligning structures that have only moderately diverged in 
evolution (Hasegawa and Holm, 2009; Kolodny et al, 2005; 
Slater et al, 2013). 
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Because this traditional approach of formulating a scoring 
function has been explored extensively over the last four decades, 
further development along the same lines is unlikely to provide 
any major breakthrough. Therefore, this field will stand to bene- 
fit by departing from the traditional approaches and exploring 
radically new ones. This paper is a step in this direction. 

Structural alignment as an inductive inference problem. The 
goal of inductive inference is to propose a theory (or hypothesis) 
that is able to best explain the observed data. Structural align- 
ment can thus be seen an instance of the general class of inference 
problems. In this context, an alignment (i.e. residue-residue cor- 
respondence) is a hypothesis that attempts to explain the residue- 
residue relationships between two protein structures, whose 
observed data is the (x, y, z) coordinates of the structures. 

In general, any hypothesis has a certain (descriptive) complex- 
ity. A complex hypothesis with more free parameters can predict 
(or fit, explain) a greater variety of observed data than a simpler 
hypothesis. Therefore, in order to choose the best hypothesis for 
any inference problem, one is confronted with a trade-off 
between hypothesis complexity and its fit with the observations. 

For structural alignments, this trade-off is related to the con- 
flict between coverage and fidelity. Coverage (in various forms 
handled in the current scoring functions) is a crude approxima- 
tion of the (alignment) hypothesis complexity. Similarly, the fi- 
delity (or goodness of fit with the observed data) of a structural 
alignment is approximated using RMSD of superposition or 
using some distance measure. These rudimentary approxima- 
tions cause the existing scoring functions to introduce several 
tunable parameters in an attempt to balance the contributions 
between coverage and fidelity of structural alignments. This has 
been a major source of the inconsistencies observed in 
alignments. 

The field of statistical learning and inference provides rigorous 
approaches to address this trade-off systematically. In the early 
1960s, several landmark papers proposed links between inductive 
inference and information theory (Kolmogorov, 1965; 
Solomonoff, 1960; Wallace and Boulton, 1968). The Minimum 
Message Length (MML) principle (Wallace and Boulton, 1968) 
provided the first practical information-theoretic criterion for 
hypothesis selection based on observations. It is used here to 
rigorously assess structural alignment quality and reliably differ- 
entiate between competing alignments. 

Structural alignment quality and lossless information compres- 
sion. The pioneering work of Claude E. Shannon (Shannon, 
1948) provides the means to quantify information: the length 
of the shortest code required to transmit, losslessly, an observed 
event. This can be understood as the length of the shortest mes- 
sage needed to communicate the event losslessly between an im- 
aginary sender (Alice) and receiver (Bob). 

In this context, the structural alignment problem can be 
rationalized as a communication process between Alice and 
Bob, where Alice has access to the (x, y, z) coordinates of two 
protein structures and she wants to encode and transmit this 
information to Bob losslessly. Two possible scenarios then 
arise: (i) If the two are unrelated to each other structurally, 
Alice cannot do better than to encode and transmit the informa- 
tion of the two structures independently, one after another. That 
is, knowledge of one structure (called reference, or S) does not 
provide information about the other (called target, or T) and, 



thus, knowledge of S cannot be used to compress T. This form of 
independent transmission is termed here as the null model mes- 
sage, (ii) On the other hand, if the two structures are structurally 
related (i.e. there is a meaningful alignment between the two), 
knowledge of S reveals information about T. The more similar 
the structures, the more information one reveals about the other. 
Alice can use this similarity to compress and transmit the infor- 
mation of the target structure using the information of the ref- 
erence. For Bob to decode the information of the target losslessly 
(i.e. to the precision with which Alice sees it), he will require the 
structural information of the reference structure plus the infor- 
mation of its proposed relationship (i.e. the structural alignment) 
with the target. This will allow Alice to encode the target more 
concisely than stating the target structure using a null model. We 
call this form of transmission, the alignment model message (to 
contrast it with the null model message, where the structures are 
transmitted independently). 

We note that this information-theoretic framework for struc- 
tural alignment is intuitive. If the proposed alignment relation- 
ship is a poor one, then the encoded alignment model message 
will be inefficient (i.e. long). Alternatively, if the alignment rela- 
tionship is a good one, then the transmission of the target be- 
comes efficient (i.e. short). Therefore, the total message length of 
the lossless transmission of coordinate information (using an 
alignment hypothesis) forms an excellent measure to assess struc- 
tural alignment quality. It follows that the best alignment is the 
one with the shortest total message length of lossless transmission. 

While we have intuitively rationalized this framework as a 
communication process, this message paradigm is also backed 
by mathematical rigour. Formally, let A denote some alignment 
between structural coordinates S and T. Using the product rule 
of probability over three events A, S and T we have: 

P(A&S&T) = P(A) x P(S\A) x P(T\S&A) 

(1) 

= P(A) x P(S) x P(T\S&A) 

where P(A&S&T) gives the joint probability of alignment A for 
structures S and T, P(A) the prior probability of the alignment, 
P(T\S&A) the likelihood of T given S and A. Note, P(S\A) is 
P(S) because S and A are assumed to be independent. 

Shannon's mathematical theory of communication (Shannon, 
1948) gives the relationship between the shortest message length 
1(E) to communicate losslessly any observation E, and its prob- 
ability P(E) as 1(E) = - log (P(E)). Technically, 1(E) denotes the 
Shannon information content of E. 

Restating equation 1 in terms of information content, we 
obtain: 

I(A&lSSlT) = 1(A) + I(S) + I(T\S&A) (2) 

where transmitting the information of the reference structure S 
takes I(S) bits, transmitting the alignment information takes 1(A) 
bits and transmitting the information of the target structure T 
using A and S takes I(T\S&A) bits. 

Our message length measure has the following three key prop- 
erties, which are not achieved by previous scoring functions: 

(1) The difference between the lengths of the messages needed 
to transmit the structures S and T using any two align- 
ments, gives their log-odds posterior ratio. 
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/(A & S & 7)-^ 2& 5 & 7)^log(^|fg) 

(3) 

/ P(S&T)P(A 2 \S&T) \ = / P(A 2 \S&T) \ 
° S \P(S &T)P(Ai \S&T)J ° g \P(Ai \S&T)J 

As a result, any two competing alignment hypotheses A\ 
and A 2 can now be compared based on their message 
lengths. Therefore, the best alignment hypothesis A* is 
the one that results in the shortest message length value 
of I(A*&S&T). 

(2) Our measure permits a natural null hypothesis test where 
the statistical significance of any proposed alignment hy- 
pothesis can be estimated. Any alignment hypothesis A 
whose message length I(A&S&T) is worse (longer) than 
that of the null model message, I(S&T) = I(S) + 7(7), must 
be rejected. 

(3) This measure provides an objective, formal trade-off be- 
tween the complexity of the alignment (1(A)) and the fi- 
delity of the structures given the proposed alignment 
(I(T\S&A)). Unlike previous attempts, these terms are 
not ad hoc approximations, as they represent rigorous es- 
timations of Shannon information content based on loss- 
less encoding and compression. 



2 METHODS 

2.1 Computation of the null model message length 

The null model message corresponds to the transmission of protein co- 
ordinates without an alignment hypothesis. (In this work, we consider 
only the C a coordinates.) We have previously defined [for a completely 
different problem (Konagurthu et al., 2012)] a null model encoding of 
coordinates along a protein chain. We will briefly summarize this ap- 
proach, as elements of this encoding are used and developed further in 
our current work. 

The null model encoding relies on the observation that the distance 
between successive C a atoms in a protein chain is highly constrained to 
3.8 ± 0.2 (s.d.) A. For a chain of coordinates {p\,p 2 , ■ ■ ■ ,p n }, any coord- 
inate p i+ i can be transmitted given the previous p h by first transmitting 
the distance r,- between p, and p i+ \ using a normal distribution J\f(r; /x, a) 
stated to 6 = 0.001 A accuracy, with /x = 3.8 A and o = ± 0.2 A. (e = 0.001 
reflects the precision of statement of coordinate data, which is three places 
after the decimal point as reported in the protein data bank.). We repre- 
sent the length of this encoding as 7(r,-). With this information trans- 
mitted, Bob now knows that p i+ \ lies on a sphere of radius r { centred 
at pi, but does not yet know where exactly on the sphere it is. Assuming 
that pi+i is distributed uniformly over the surface of the sphere, trans- 
mitter Alice can discretize the sphere's surface into cells, each of area e 2 . 
Using this discretization, p i+ \ can be transmitted as cell number c i+ \. 
The numbering convention of the discretization is in the shared 
codebook. With the knowledge of p b r t and c b Bob can 
reconstruct p i+ \ to the stated accuracy. Stating the cell number takes 

I(ci) = - log 2 (4^2) = log 2(4777f ) - 21og 2 e bits. 

When sending a chain of C a coordinates {p\,p2, • • • ,p n ) over a null 
model message, we assume that p\ is the origin and, hence, does not need 
to be transmitted as part of the message. Even if p x is not assumed to be 
the origin, its encoding will add a fixed one-time cost to the message 
length. Thus, to transmit the chain of points p\, P2, . . . , p n over the null 
model, Alice needs to send to Bob the number n of C a atoms in the chain, 
followed by incrementally transmitting (using the method above) p 2 given 



Pi, p 3 given p 2 and so on, until all coordinates are transmitted. Alice can 
transmit the number n over a integer distribution. Wallace and Patrick 
(1993) gave an efficient code (/integer (n)) for transmitting any positive in- 
teger n > 0. 

Therefore, the total message length required to send all the coord- 
inates over the null model message takes I n u\\ip\, ■ ■ ■ , ^2) = integer W + 

^r\(I(ri) + I( Cl )) bits. 

Using the above, the coordinates of structures S= {S\, S2, . . . , S\s\) 
and T={T\, T2, . . . , T\ T \) are sent as independent chains of coordinates 
over the null model, taking / nu ii(5 , &J) = / nu n(5') + / nu u(J) bits. 

2.2 Computation of the alignment model message length 

Equation 2 gives the amount of information required to transmit the 
coordinates of structures S and T using the alignment hypothesis A. 
To estimate this, the explanation message involves transmitting: the 
coordinates of S, the residue-residue correspondences proposed by 
the alignment in A and, finally, the coordinates of T using the informa- 
tion of S and A. 

2.2.1 Transmitting the coordinates of S This is achieved by sending 
the coordinates of S= {Si, S 2 , ■ ■ ■ , S\s\} over the null model. Therefore, 
I(S) = I nu]1 (S l ,S 2 , ...,%) bits. 

2.2.2 Transmitting the correspondences in A Any alignment can be 
described as a string switching between three states: match ('m'), insertion 
(T) and deletion ('d') states. This alignment string can be transmitted 
losslessly using a first-order Markov model. 

To transmit an alignment over a 3-state Markov chain, we use an 
approach similar to the adaptive encoding method used by Wallace and 
Boulton (1969) over a multinomial (n-state) distribution. The adaptive 
encoding here requires maintaining nine running counters, one for each 
possible transition probability, all initialized to 1. Traversing the align- 
ment string left to right, for every observed transition, Alice estimates its 
probability by dividing the current value of the corresponding transition 
counter by the sum of all counters from previous to any state. After the 
probability is estimated, Alice encodes the current alignment state using 
this probability and then increments the corresponding counter by 1 . The 
code length to encode each state is the negative logarithm of its estimated 
probability. Summing each transition over the entire alignment gives the 
code length, 1(A). 

2.2.3 Transmitting the coordinates of T given S and A With 
the information of S and A known to Bob, Alice can now use that in- 
formation to encode the coordinate information of T. Intuitively, our 
encoding is based on the fact that when scanning A from left to right, 
Alice views T as runs of coordinates that alternate between blocks of 
insertions and matches with respect to S and the stated alignment. 
Note that all deletion blocks (with respect to S) in T are ignored as 
they contain no information to be transmitted about T. More formally, 
let A yield {1\ , . . . , l m } insertion blocks, where any represents a con- 
secutive stretch of coordinates that are inserted in T (with respect to S). 
Each insertion block is transmitted as a null message taking 
I ms (T\S&A) = Y2= 1 Jnunff Obits. 

What remains to be sent to Bob are the coordinates in T aligned to 
corresponding coordinates in S, that is, the matches. Let {S^ , S i2 , . . . , S in } 
and {7}j , 7} 2 , . . . , 7}J where 1 < i\ < . . . i n < \S\ and 
1 < j\ < • • Jn <\T\, denote the ordered set of corresponding coordinates 
in S and T, respectively. Bob already knows S and the alignment. From 
the alignment information he can infer the indexes of the aligned residue- 
residue correspondences: (i\, j\), (i 2 , j 2 ), ... , (i n , j n ) between S and T. 
Thus, Alice can use the following procedure to transmit the aligned co- 
ordinates in T. To start the procedure, the first three matched coordinates 
of {7} 1? Tj 2 , 7} 3 } are sent over the null model message taking: I & tartup(T\S 
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&^l) = / nu ii(7} 1 , Tj 2 , 7} 3 )bits. Alice then incrementally sends the remaining 
aligned coordinates of T as follows. To transmit the current aligned co- 
ordinate Tj k+l , Alice considers only the set of (previous plus current) 
aligned coordinates {T) 1? 7) 2 , T jh , T jk+l }. This set is orthogonally 
transformed to the set {Tj l , Tj 2 , . . . , Tj k , Tj k+l }, such that it minimizes 
the least-square error between {S h , ... , S ik } and {T h , ... , T jk }. Using 
this setup, Alice can transmit Tj k+l over a directional distribution on a 
sphere. This is achieved by first transmitting the radius r k = \ \ T Jk+l — T jk \ \ 
over a normal distribution with the same procedure described for formu- 
lating a null model message. This allows Alice to state 7) A + 1 as a point on 
a sphere with radius r k centred at Tj k . However, we do not state it over a 
uniform distribution (which would make it a null model description), as 
the knowledge of correspondence of Tj k+1 with Si k+1 gives clues about its 
position on the sphere (provided the assigned correspondence is a 'good' 
one). Because Bob already knows the corresponding point S ik+l , after 
transmitting r k , Alice can use a directional probability distribution to 
state Tj k+1 more concisely. In directional statistics, the von Mises- 
Fisher distribution gives the probability density function (PDF) on the 
surface of any sphere in ^-dimensions. In three dimensions, the PDF on 
the surface of a unit sphere is (Fisher, 1953; Mardia and Jupp, 1999): 

Using this distribution to transmit Tj k+l , we compute x^+i as the 
direction cosines of the vector Tj k+l — Tj k , and Afe+i as the direction 
cosines of the vector S ik+l — T jk . The probability of stating T jk+1 to the 
required precision (that is, 6 = 0.001 A precision on each component) 
using von Mises-Fisher distribution over the surface of a 3D sphere of 

-e*£-* where e' 2 = ac- 



unit radius is then given by: P(x) = e / 2^(e K - 
counting for the scaling of the sphere of radius r t to a unit sphere. 
Transmission of each T jk+l requires the concentration parameter k. We 
use the maximum-likelihood estimator based on the available superpos- 
ition [see Mardia and Jupp (1999)]. Therefore, the code length to state x 
using von Mises-Fisher is: / V mf(-£) = - log (P(x)) bits. 

Each Tj k+X is transmitted iteratively over this procedure, which we term 
adaptive superposition. Thus, the message length required to transmit the 
matched points in T with respect to their corresponding point in S is 

/match(^l^&^) = /startu P (^l^&^) + Xl"=4 /vmf ^') bits - Combining the 
message lengths of transmitting coordinates in the insertion and matched 
blocks gives I(T\S&A) = I im (T\S&A) + I matc h(T\S&A) bits. An illustra- 
tion of this procedure is shown in Figure 1. 

2.3 Measure of alignment quality 

I(A&S&T) is used as the measure of alignment quality. We call this 
measure /-value, indicating a value measuring the information content 
in the structural coordinates of S and T, given the structural alignment A 
as a model of compression. The smaller the /-value, the better the align- 
ment. It follows that for competing alignments A\ and A2, if the /-value 
of A\ is smaller than that of Ai by, for example, 15 bits, then A\ is 
2 15 times more likely than A2 (see property 1 of this measure shown 
in Equation 3). Further, any alignment A for which I(S&T&A)> 
I nu \\(S&T) = / nu ii(5) + / nu u(7) can be rejected (see property 2 of this meas- 
ure under 'Structural alignment quality and lossless information 
compression'). 

Handling shifts and rotations. So far we estimated the /-value under the 
rigid model of structural alignment. This model can be generalised to 
handle plastic deformations commonly observed in protein evolution, 
such as hinge rotations and shifts. Handling these deformations requires 
a modification in the way I(T\S&A) is estimated under a flexible model of 
transmission. 

Without loss of generality, assume that T contains a certain number of 
shifts and rotations, with respect to S, associated with its residues. In 
computing I(T\S&A), alignment A is partitioned at the residues in T 
about which the shifts and rotations are defined. For example, consider 
below an alignment containing a hinge rotation about residue 10 of J 




Fig. 1. An idealized example of the adaptive superposition used to send the 
matched residues in T (in blue) incrementally given the knowledge of S (in 
black). Both structures have 8 points and are assumed here to be in one-to- 
one correspondence. Assume that Bob already knows the first 3 points of 
T. Alice sends the fourth point in T by superposing all previously matched 
points between the two structures. (Green crosshairs shows the rotational 
centre of superposition.) This orients the fourth point (in red) in T [or, 
more generally, T Jk+l , whose deviation from its corresponding S ik+l can be 
encoded over a von Mises-Fisher spherical distribution (see main text)] 



(marked by *). Then, the alignment can be partitioned into two separate 
parts as follows: 

* 1 1 
123 4567890123 

S — XXX XXXXXXXXXX 

T XXXXXXXXXX XXXXX 

1234567890 12345 
1 



123 45 567890123 

— XXX XX XXXXXXXXX 

XXXXXXXXXX + X — XXXXX 
1234567890 0 12345 
1 1 



Let these partial alignments be denoted as A(T\, . . . , T\ 0 ) and 
A(T\o, . . . , 7^15), identifying the start and end residue indexes in T 
about which the partition is defined. Then I(T\S&A) is computed as 
I(T\S&A(Ti Tio)) + I(T\S&A(T l0 , T l5 )) using the procedure 
described earlier. More generally, if there are k residues in T about 
which shifts/rotations are defined, the full alignment A is parti- 
tioned into k + 1 partial alignments: A(T\, . . . , T h ), A(T h , . . . , T h ), . . . , 
A(T ik , ... ,T\ T \), where i\ < 12 < . . . < 4 < I T\ . Given these partitions, 
I(T\S&A) can be computed as I(T\S&A(T U ... , 7^)) + • • • +I(T\S& 
A(ik, ■ ■ ■ , 1 71)). This immediately poses another inference question: 
Given an alignment A of S and T, how many shifted/rotated residues 
does it contain? We note that adding a shift/hinge has an overhead 
which must pay for itself with a better fit if it is to be accepted. 

Inference of shifted/rotated residues: A dynamic programming algo- 
rithm is used to optimally partition A minimizing I(T\S&A). 

The algorithm first constructs a matrix M of size \T\ x \T\ such that 
each cell M(iJ) (1 < i<j < \T\) stores the value I(T\S&A(i, ...J)). The 
best partition of A is then computed using the following dynamic pro- 
gramming recurrence relationship: 



V{\ j) = 



M(\,j), 

V(l,...,i) + M(iJ) 



vi <;< m 



(4) 



where any V(l, . . . , i) gives the optimal partitioning up to the /th residue 
in T, 1 < i < \T\. At the end of this procedure the value V(l,...,\T\) 
gives the component message length I(T\S&A) of Equation 2, in a way 
that handles shift and hinge rotations. 



2.4 The time complexity of computing /-value 

Using the rigid model of transmission (i.e.without handling the hinge- 
rotations and shift), the computation of I(A&S&T) is linear in the size of 
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DALI Scoring Function TM-Score /-value Compression over Null model (in bits) 



DAL I 



TM-Align 




Fig. 2. Table comparing the value of the DALI, TM-Score and /-value scoring functions (Columns) over 5 SCOP groups (see main text) containing 500 
alignments each, generated by (rows) DALI and TM-Align programs. Note that the Y-axis uses different scales, as the range of values differ between 
scoring functions. Therefore, the absolute heights of the boxes cannot be compared between the box- whisker plots. However, their performance can be 
compared by the relative overlaps of the various quartile levels for each group with respect to others within the same box- whisker plot 



the alignment, as the computation of I(S), 1(A) and I(T\S&A) are all 
linear. While the linearity of the first two is clear, that of I(T\S&A) is not, 
as it requires repeated adaptive superpositions. However, we have 
recently proved sufficient statistics for the orthogonal superposition prob- 
lem that allows each updated superposition to be computed as a constant- 
time update over the previous ones (Konagurthu et ah, 2014), making the 
computation of I(T\S&A), and /-value under rigid superposition, linear. 
On the other hand, using the flexible model which allows for hinge rota- 
tions and shifts, the computation of I(A8lSSlT) is quadratic, as it is 
dictated by the complexity of the dynamic program given by Equation 4. 

3 RESULTS AND DISCUSSION 

We have compared the quality of our /-value measure (using the 
flexible model described in sections 2.3) with popular scoring 
functions DALI, TM-Score, MI, SI, STRUCTAL, LGA_S3, 
GDT_TS, SAS and GSAS, using a large data set of alignments 
produced by the popular structural alignment methods DALI, 
TM-Align, LGA, CE and FATCAT. [Refer to Hasegawa and 
Holm (2009) for references of these scores/aligners.] Due to lack 
of space, we restrict our results herein to those obtained when 
comparing the DALI Score, TM-Score, and /-value measures, 
using TM-Align and DALI as the structural alignment gener- 
ators. We refer to our online supplementary material for the 
remaining results. 

Our first experiment tests the ability of the scoring functions to 
differentiate between pairs of structural domains that vary along 
the hierarchical groups defined by SCOP (Lo Conte et al, 
2000) — Class, Fold, Superfamily and Family. To do this, we 
randomly selected a set of 500 'pivot' domains from SCOP 
and, for each of these pivots, we randomly selected five other 
domains whose relationship with the pivot varies progressively: 
(i) Same-Family, (ii) Same- Superfamily (but not Family), (iii) 
Same-Fold (but not Superfamily and Family), (iv) Same-Class 
(but different below this level) and (v) Decoy (or different-Class). 



This results in a collection of 500 x 5 SCOP domains. We then 
aligned each pivot with each of its five counterparts, generating a 
total of 2500 alignments per alignment program. Finally, we as- 
sessed all these alignments using the selected scoring functions. 

Figure 2 shows (some of) the box-whisker plots resulting from 
these comparisons. (As mentioned earlier, more results are given 
in Supplementary Table SI of the online supplementary mater- 
ial.) Rows in this table denote the alignment method (DALI/ 
TM-Align) used to generate the 2500 alignments in our collec- 
tion. Columns denote the scoring function (DALI, TM-Score, I- 
value) used to compute the alignment score. Each cell in the table 
is a box-whisker plot that displays the numerical scores (as quar- 
tile marks) produced by each [alignment method, scoring func- 
tion] pair, over the five groups of (500) alignments each. Note 
that for /-value, we show the compression gained (in bits) over 
the null model message length, that is, the (Null-/- value) message 
lengths. Thus, the greater the compression, the better the align- 
ment. (In contrast, when using raw /-values rather than compres- 
sion with respect to Null, the smaller the /-value the better the 
alignment.) 

A cursory inspection of these box- whisker plots indicates that, 
for the given alignments (which might not be the best/optimal 
ones), all scoring functions consistently differentiate between 
the SCOP groups to some extent. However, none of the scoring 
functions can be said to separate the SCOP groups cleanly nor to 
be clearly better than the others. This reflects partly the fuzzy 
classification boundaries of SCOP, and partly the quality of the 
(sub-optimal) alignments of domain pairs generated by popular 
alignment methods. For example, for TM-Score it has been 
claimed that the numerical score of<0.5 corresponds to align- 
ments not being in the same fold. However, we observe from the 
box-whisker plots in the second column of the table (the Fold 
group), that >3 quartiles of the alignments have a TM-score of 
<0.5. Inspecting the box-whisker plots at the same-Fold group, 
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Table 1. Comparison between DALI score, TM-Score and /-value on ambiguous alignments reported by Zu-Kang and Sippl (1996) 



Structures 
Sv.T 


Residues 




Alignment Ai 






Alignment Aj 




\s\ 


m 


DALI Score 


TM-Score 


I(Ai,S,T) 


Dali Score 


TM-Score 


I(A 2 , S, T) 


2TMVP v. 256BA 


154 


106 


262.8 


0.4871 


9611.2 bits 


242.6 


0.4744 


9614.0 bits 


1TNFA v. 1BMV1 


152 


185 


265.1 


0.3815 


12577.1 bits 


307.3 


0.3947 


12463.7 bits 


1UBQ v. 1FRD 


76 


98 


161.9 


0.4790 


6384.8 bits 


146.1 


0.4518 


6409.8 bits 


2RSLC v. 3CHY 


119 


128 


182.6 


0.3773 


9159.8 bits 


206.1 


0.3768 


9143.5 bits 


3CHY v. 1RCF 


128 


169 


377.5 


0.4960 


10983.0 bits 


336.4 


0.4855 


10961.8 bits 



Note: For DALI and TM-score, the higher the score the better the alignment. For I-value, the smaller the value (or message length), the better is the alignment. Bold numbers 
in each row indicate the better of the two competing alignments under each of the scoring measures. 
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RMSD (Angstroms) 

Fig. 3. Sieving of the two ambiguous alignments reported by Zu-Kang and Sippl (1996) for the pair 1RCF and 3CHY. The X-axis gives the RMSD of 
each sieved alignment, while the Y-axis gives the scoring function used: (a) DALI, (b) TM-Score and (c) /-value. The labels in the figures correspond to 
the NEquiv values during sieving. For the TM-Score plot (b), the horizontal dotted line is the threshold for fold-level relationship. For the /-value plot 
(c), the horizontal line corresponds to the NULL model message length. The sieved alignments below this line are statistically significant 



shows that a very large majority of alignments produced by 
DALI and TM-Align and scored using /-value are seen to be 
statistically significant (i.e. with Null-/- value message length >0), 
with DALI alignments being more reliable than those generated 
by TM-Align, judging by the compression gain at the level of 
their respective medians. Interestingly, /-value seems to provide 
the smallest of variations when comparing the results obtained 



for the alignments generated by DALI to those generated by 
TM-Align. 

Upon closer inspection, the results of our experiment indicate 
a significant degree of disagreement between the respective 
scores: only in 28% of the 2500 pivot- versus-counterpart align- 
ments, all three scores agree on whether the alignment produced 
by DALI or by TM-Align is the best. This disagreement 
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highlights the need for a generally accepted, rigorous alignment 
score. Supplementary Table S2 in the online supplementary ma- 
terial shows the full list of disagreeing pairs. 

Our second experiment uses the set of ambiguous alignments 
described by Zu-Kang and Sippl (1996), as case studies for the 
various scoring functions ability to differentiate between very clo- 
sely competing alignments. These alignments are indistinguish- 
able in RMSD and number of equivalences (NEquiv). Table 1 
compares the three scoring functions across five pairs of ambigu- 
ous structural alignments. For each scoring function, the better 
score of the two alignments for each pair is highlighted in bold. We 
again observe disagreement in two out of the five pairs between 
the scoring functions in their ability to decide which of the two 
alignments is the best. We emphasize that this discrimination is 
crucial for any scoring function to be useful as the basis of a search 
method looking for the optimal alignment. 

Let us illustrate the problems in using some of these scoring func- 
tions for an optimal alignment search method, using the case study of 
pair 3CHY v . 1RCF. To do so we 'sieve' each of the two ambiguous 
alignments using the following procedure, similar to the one 
described in Irving et al. (2001), to generate a number of competing 
alignments at varying levels of NEquiv and RMSD: (i) Compute the 
[NEquiv, RMSD] and corresponding DALI, TM-Score and /-value 
of the current alignment, (ii) Delete the worst-fitting aligned pair that 
appears at the end of a 'matched block' in the alignment, and force 
the deleted pair of residues to be unaligned, (iii) Repeat from Step 1 
until the RMSD falls below a threshold value. 

Studying the three sieving plots corresponding to DALI, TM- 
Score and /-value in Figure 3, it is immediately clear that TM- 
Score (Fig. 3b) does not produce any clear optima for this set of 
competing alignments to choose from. In fact, TM-Score mono- 
tonically increases towards the TM-scores of the unsieved align- 
ments. On the other hand, /-value and the DALI score produce 
clear, though conflicting optima, /-value points to an optima for 
the sieved Alignment 1 at [NEquiv, RMSD] of [101,3.7 A], 
whereas DALI points at [94, 3. 17 A]. (Superpositions based on 
these two alignments can be found in the online supplementary 
material.) This reflects the standard dilemma human experts and 
scoring functions face in choosing between two conflicting ob- 
jectives. However, /-value objectively discriminates between the 
two competing alignments in terms of the lossless compression 
achieved. We emphasize that compression takes into account all 
aspects of the trade-off — descriptive complexity of the alignment 
hypothesis versus the quality of fit of the alignment to the struc- 
tural data — that manifests in the structural alignment problem. 

Finally, as a proof of concept, we have also developed a quad- 
ratic-time dynamic programming heuristic to search for the op- 
timum /-value alignment. Currently this alignment heuristic is 
restricted to the rigid (and not flexible) model of computing 
the component message length I(T\S, A). Details of the heuristic 
search are beyond the scope of the current article, given the page 
limitations. Our implementation of this alignment heuristic based 
on /-value is available for use from our website. 

4 CONCLUSIONS 

The importance of finding biologically meaningful structural 
alignments has led to the intensive development of methods for 



generating alignments and evaluating their quality. However, 
these methods produce conflicting results and none has been 
generally accepted as clearly superior. 

Here we have described a measure of alignment quality, 
/-value, that uses the information content of messages that loss- 
lessly compress the Ca atoms of a pair of protein structures, 
given a proposed alignment. A lower /-value signifies a superior 
alignment. The method contains no adjustable parameters, as it is 
built on a formal Bayesian principle of minimum message length 
inference. 

Examination of competing alignments over many pairs of pro- 
tein structures demonstrates that /-value can accurately distin- 
guish between competing structural alignments in cases in which 
other methods either cannot significantly distinguish the quality 
of these possibilities, or do not agree in the selection of a best 
one. 
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